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Abstract. Bayesian model comparison requires the specification of a 
prior distribution on the parameter space of each candidate model. 
In this connection two concerns arise: on the one hand the elicitation 
task rapidly becomes prohibitive as the number of models increases; 
on the other hand numerous prior specifications can only exacerbate 
the well-known sensitivity to prior assignments, thus producing less 
dependable conclusions. Within the subjective framework, both diffi- 
culties can be counteracted by linking priors across models in order to 
achieve simplification and compatibility; we discuss links with related 
objective approaches. Given an encompassing, or full, model together 
with a prior on its parameter space, we review and summarize a few 
procedures for deriving priors under a submodel, namely marginaliza- 
tion, conditioning, and Kullback-Leibler projection. These techniques 
are illustrated and discussed with reference to variable selection in lin- 
ear models adopting a conventional g-prior; comparisons with existing 
standard approaches are provided. Finally, the relative merits of each 
procedure are evaluated through simulated and real data sets. 

Key words and phrases: Bayes factor, compatible prior, conjugate 
prior, g-prior, hypothesis testing, Kullback-Leibler projection, nested 
model, variable selection. 



1. INTRODUCTION 

Model comparison is an important and active area 
of research especially from the Bayesian viewpoint; 
see, for example, George (1999) and Robert (2001, 
Chapter 7). In particular, the problem of variable 
selection in linear models has received considerable 
attention; see the review paper of George (2000) and 
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a few survey chapters in the book edited by Dey 
and Rao (2005). Two critical issues emerge from the 
very beginning: the elicitation of prior probabilities 
for the various models under consideration and the 
assignment of prior distributions on the parameter 
space of each model, which we simply call priors. In 
this paper we focus on the latter. 

Occasionally, when the model space is not large 
and detailed prior information is available, subjec- 
tive prior elicitation on each model can be carried 
out; see Garthwaite and Dickey (1996). More often, 
however, because of the potentially very high num- 
ber of models under investigation, prior elicitation 
can represent a formidable task, and hence practi- 
cally implementable procedures have been actively 
looked for. In the objective framework (see Berger 
and Pericchi, 1996b), a convenient approach is to 
start with a default, typically improper, prior under 
each model, and then to circumvent the indetermi- 
nacy of the normalizing constant through an intrin- 
sic prior procedure (see also Casella and Moreno, 
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2006, for an application to variable selection in lin- 
ear models). A more general approach, namely ex- 
pected posterior prior, is described in Perez and 
Berger (2002). 

Outside the purely objective view, pragmatic sim- 
plification of the elicitation task in the variable se- 
lection problem has been achieved through hierar- 
chical mixture priors as in George and McCulloch 
(1997), or using an empirical Bayes approach, as 
in George and Forster (2000), and more recently in 
Yuan and Lin (2005), or employing a blend of nonin- 
formative and conjugate procedures, as exemplified 
in Fernandez, Ley and Steel (2001). Recently Liang 
et al. (2008) have proposed mixtures of g-priors as 
an efficient tool for Bayesian variable selection. 

Within the subjective framework, which uses proper 
priors, the idea of relating priors across models does 
not seem to be pervasive. Notable exceptions are 
Dickey (1971) and Poirier (1985), in the context of 
linear models; see also the discussion in O'Hagan 
and Forster (2004, Sections 11.29-11.31). Neal (2001) 
introduces the idea of transferring prior information 
from a "donor model" to a "recipient model." His 
motivation is primarily pragmatic: priors for com- 
plex models are harder to elicit than those for simple 
models; accordingly one can try to carefully elicit a 
prior under a simple "donor" model and then trans- 
fer this information to a complex "recipient" model. 
Technically Neal's method is similar to, although 
more general than, the expected posterior prior of 
Perez and Berger (2002). The paper by Dawid and 
Lauritzen (2001) stands out as an attempt to dis- 
cuss, in a general setting, methods to construct "com- 
patible priors" for nested models using a variety of 
strategies. Their motivation is mixed: on the one 
hand they state that conceptually there is no com- 
pelling reason to relate priors across models (since 
they express subjective opinions conditionally on a 
different state of information); on the other hand 
such relationships may be highly desirable on prag- 
matic grounds (the effort spent in eliciting a prior 
under a model should somehow be transferred to 
other models) and also to achieve some sort of com- 
patibility in order to lessen the sensitivity of the 
Bayes factor to prior specifications. 

Following up this comment, we believe that pri- 
ors for model comparison deserve to be carefully in- 
vestigated by the Bayesian community. Traditional 
priors, which individually perform quite effectively 
within a single model, need not work satisfactorily 
when collectively employed for comparing models of 



varying dimensions. This fact has been informally 
recognized at least since Jeffreys, who refrained from 
using conventional priors for comparing two nested 
hypotheses; see also Zellner and Siow (1980) in the 
framework of linear models. 

In the context of comparing a sharp null hypoth- 
esis Hq versus a composite alternative H, Morris 
(1987) argued forcibly for the prior under H to be 
"centered around Hq"; otherwise the prior under H 
would be "wasting away" prior probability mass in 
regions that are often too unlikely to be supported 
by the data, thus unduly favoring Hq, as lucidly 
spelled out in Casella and Moreno (2007); see also 
Consonni and La Rocca (2008). Carefully extend- 
ing this argument to several models would surely be 
of great value and interest in order to enhance our 
understanding of the issue of compatibility of priors 
for model comparison. While this paper falls short of 
providing a comprehensive treatment of this point, it 
nevertheless tries to offer some guidance for further 
reflection and research. Specifically, we try to eluci- 
date the meaning of the term "submodel," or nested 
model, in order to highlight differences between a 
couple of approaches which are implicit in the lit- 
erature and better understand specific strategies to 
relate priors across models. Although the scope of 
our considerations is general, we will illustrate the 
main ideas with reference to the problem of variable 
selection in linear models. 

The structure of the paper is as follows. Section 
2 deals with two notions of nested models and dis- 
cusses the corresponding parametrization, distinguish- 
ing between nuisance and common parameters. Sec- 
tion 3 deals with strategies to assign priors on pa- 
rameters of submodels starting from a prior on the 
(full) model; we discuss conditioning and projection 
(including marginalization) and propose, in Sections 
3.1 and 3.2, two criteria to evaluate such strategies, 
which we name nuisance- and nested-coherence. Sec- 
tion 4 deals with priors for linear models. Starting 
with a g-prior under the full model, a variety of prior 
specifications on submodels is obtained through the 
procedures described in Section 3; in particular Sec- 
tion 4.2 contains a discussion of the so-called "infor- 
mation paradox." Section 5 presents three examples 
to evaluate the performance of the various priors 
under consideration in terms of model comparison, 
with special references to sensitivity issues. Finally, 
Section 6 provides a few points for discussion. To 
ease the flow of ideas, technical aspects have been 
relegated to the Appendix. 
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2. SUBMODELS 

2.1 A Preliminary Example 

We start by discussing a very simple example with 
the aim of presenting the main issues at stake. Con- 
sider the following model: 

M: yi = a + f3xi+Ei, i = l,...,n, 

(a, (3, a 2 ) G = M. x R x M. + 

where, conditionally on a 2 , Ei ~ N(0,cr 2 ). An obvi- 
ous submodel, say M* , removes the predictor, thus 
changing the mean structure. However, several in- 
stances of M* are available, namely: 



M* A : 


Vi 


= a + £i, 




M%: 


Vi 


= » + £*, 


£i ~N(0,a* 2 ); 


M* c : 


Vi 


= a* + Ei, 


^~N(0,a 2 ); 


M* D : 


Vi 


= a* + £*, 


£ *^N(0,<7* 2 



Model M* A originates in the setting of hypothesis 
testing postulating that (3 = under M; in other 
words, M* A is equivalent to the hypothesis H* : y ~ 
M and /3 = 0. As a consequence the parameters a 
and a 2 are "common" to both models, although 
one might further distinguish between them, since 
a 2 pertains to the error structure (which is not af- 
fected explicitly by the submodel specification), and 
thus can be regarded as a "nuisance" parameter. 
Model Mh originates from the consideration that 
the error component in the submodel might, and 
perhaps should, be allowed to be different from that 
under M. In particular, since one can anticipate 
a worse fit under M* B than under M, one should 
have E(a* 2 ) > E(a 2 ) or even a* 2 > a 2 (with prob- 
ability 1). Model M* c originates from the consider- 
ation that the meaning of the intercept is actually 
quite different under the two models, and so should 
be distinct from that under M. On the other hand 
a 2 remains the same, since it is regarded as a "nui- 
sance" parameter. Finally model Mh combines the 
specific features of M* B and M* c , and has no di- 
rect link, unlike the previous versions, to M. For a 
related discussion on alternative interpretations of 
submodels, see Berger and Pericchi (2001, Section 
1.5, "Difficulty 4"). 

In an abstract sense, all instances of M* above 
represent the same submodel, since they share the 
same family of distributions. However, the distinc- 
tive features that we have tried to underline should 



make it clear that they are different objects, or per- 
haps different ways of looking at the same object. 
For a given prior it on (a, (3, a 2 ) under M, we re- 
quire a prior, ir* say, under M* . We claim that each 
instance of M* naturally suggests a different proce- 
dure to obtain it* from ir. 

Consider first model M* A . There are two natural 
candidates for ir* , namely 7r(a, a 2 ) and ir(a, a 2 \(3 = 0), 
that is, the marginal and the conditional (on (3 = 
0) distribution derived from ir(a, (3, a 2 ). The lat- 
ter might appear more natural, if the hypothesis- 
testing interpretation of M* A is strictly adhered to. 
Note that the two procedures lead to the same pri- 
ors if {a, a 2 ) is independent of (3, as it occurs us- 
ing default priors. For model M* B , instead, no obvi- 
ous indications are provided for the specification of 
7r*(<7* 2 ); on the other hand, since a is "common" to 
both models, a natural suggestion would be to take 
ir* (a) = ir(a). Of course the problem of combining 
the two marginal distributions into a joint one re- 
mains open. Under model M* c a situation somewhat 
similar to that under M* B obtains, if we interchange 
the role of the intercept and the variance. Finally, 
neither marginalization nor conditioning appears as 
obvious recommendations under M* D , because no 
effective link with M is specified. The next sections 
explore these issues in greater generality. 

2.2 Nested Models 

It could be argued that each of the models M* 
described in Section 2 is nested in M. However, we 
feel some other clarification is needed. 

Consider a model M = {f(-\9),6 G ©}. There seem 
to be two interpretations of a nested model M* in 
the literature, often not clearly distinguished. Both 
start from the assumption that it is possible to write 
= (A, 0), where A € A and eft G <I>, with A and be- 
ing variation-independent, so that O = A x $ and 
model M* is identified through the constraint = 
4>o, with 0o a fixed value. As suggested by a ref- 
eree, this setting covers only the case in which the 
parameter space O* associated with M* has dimen- 
sion strictly smaller than that of M, and thus it 
does not account for other interesting nesting situ- 
ations in which dim(0*) = dim(O) (e.g., when O* 
is a restriction of O). However, the above (A,0)- 
representation is especially useful from the perspec- 
tive of "prior assignment" under submodels, which 
is the primary focus of this paper. We describe these 
two interpretations below. 
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S-N (Strongly nested interpretation): The sampling 
distribution of y under Ai* is given by /*(-|A), A € A, 
where /*(-|A) = /(-|A,</> = (fro)- This interpretation 
can be clarified in terms of the underlying gener- 
ating process of y: "If Nature chooses A E A and 
<p = 4>q , then the distribution of the observables un- 
der A4 and A4* is the same." 

W-N ( Weakly nested interpretation): The sampling 
distribution of the observations y under M* can be 
written as /*(-|7), 7 € A, with /*(-|7) = /(-|A = 
"f,(f> = 4>o)- In this way 7, although structurally equiv- 
alent to A, is distinct from it. Clearly, each distribu- 
tion in A4* also belongs to A4. 

Interpretation S-N is rooted in a hypothesis-testing 
context, that is, H* : 4> = 4>o, where the actual objec- 
tive of the analysis is verifying whether = 0q , other 
things being held equal. On the other hand, W-N is 
better suited when the objective is model simplifi- 
cation, and each model competes against the other 
ones according to whatever criterion is deemed to 
be appropriate (e.g., a combination of fit and par- 
simony, or on predictive grounds; see, e.g., Gelfand 
and Ghosh, 1998 and Marriott, Spencer and Pet- 
titt, 2001). With regard to the example in Section 
2.1, M\ is the only instance of M* that falls un- 
der interpretation S-N. The S-N view is probably 
the most pervasive and is regarded natural 
framework by, for example, Poirier (1985), O'Hagan 
and Forster (2004, Section 7.15) and Davison (2003, 
page 127). It seems implicit in George and Forster 
(2000) and other workers mostly interested in com- 
putational aspects, for example, Smith and Kohn 
(1996), Nott and Green (2004) and Cripps, Carter 
and Kohn (2005). On the other hand, authors like 
Berger and Periccchi (1996a) and also Robert (2001, 
Section 7.2) seem to prefer interpretation W-N. 

Within the interpretation S-N, consider a collec- 
tion of submodels and suppose that, for each 
A4fc, there exists a reparametrization of A4 as 
(S,r]k,ujk), so that A4fc is identified by = 77^0 ■ Since 
5 is never involved in any submodel specification we 
can regard it as a nuisance parameter; on the other 
hand we call u>k the parameter common to the pair 
(M,Mk)- In the setting of variable selection for lin- 
ear models, the nuisance parameter is clearly rep- 
resented by the error variance a 2 , while common 
parameters are the regression coefficients that are 
not set to zero in the submodel specification. 



We close this section with a caveat that hopefully 
will not disconcert the reader. Despite our insistence 
on model interpretation and parametric description, 
we emphasize that what matters in a Bayesian anal- 
ysis is the prior distribution attached to the pa- 
rameters of the various models regardless of their 
formal representation. The latter, however, may be- 
come relevant when structuring prior specification 
across models. This is the topic of the next section. 

3. STRATEGIES TO ASSIGN PRIORS ON 
PARAMETERS OF SUBMODELS 

Within the objective Bayesian framework, the ex- 
pected posterior prior (EPP) methodology of Perez 
and Berger (2002) is a method to construct prior 
distributions for model comparison; see also Neal 
(2001) for related concepts. The idea is to start with 
a prior distribution under each model, compute its 
posterior under "imaginary" observations, and for- 
mally average the posterior through a marginal data 
distribution that is common to all models. The method 
is quite general, but is especially effective if one 
starts with a default, possibly improper, prior under 
each model. In this way the EPP method allows to 
use improper priors for model comparison through 
Bayes factors, or posterior model probabilities, since 
the indeterminate normalizing constants cancel out. 
More generally, EPP is a method to make priors 
"compatible" across models, through their depen- 
dence on a common marginal data distribution; thus 
this methodology can be applied also with subjec- 
tively specified (proper) prior distributions. 

Although appealing and flexible, implementing the 
EPP methodology may be problematic. First of all 
the choice of the common distribution is not unique. 
For instance, there exist at least two competing choices, 
namely that corresponding to the "simplest" model, 
if it exists, and that corresponding to the empirical 
distribution, which requires the identification of a 
minimal training sample; see Berger and Pericchi 
(2004) for a discussion of potential difficulties asso- 
ciated to this concept. More importantly, to judge 
the relative merits of the above two choices is not 
straightforward. A second concern refers to the ac- 
tual implementation of the EPP, which may require 
careful computational strategies. 

A more specific approach is the intrinsic prior 
methodology, which has received a great deal of at- 
tention both for hypothesis testing and for model 
selection. Again the primary motivation is the use of 
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default noninformative priors under each model; see 
Pericchi (2005) for a review. When several models 
are entertained the intrinsic method requires a nest- 
ing strategy. One approach, labeled "encompassing 
from above," chooses as benchmark a full model 
wherein all other models are nested. In this way, 
however, the prior under the full model changes in 
each pairwise comparison, thus producing an overall 
incoherent probabilistic answer. Yet posterior prob- 
abilities can still be formally defined on the basis of 
the collection of Bayes factors of each model relative 
to the full one; see Casella and Moreno (2006) for 
an application to variable selection in linear models. 
On the other hand, if the simplest model (i.e., one 
being nested within any other model) is available, 
an alternative "encompassing from below" intrinsic 
prior procedure can be followed, which is probabilis- 
tically correct; for an application to variable selec- 
tion see Moreno and Giron (2007). Notice that the 
two alternative encompassing procedures will typ- 
ically lead to distinct answers. As with the EPP 
methodology, analytic evaluation of intrinsic priors 
is typically very hard and actual implementation of 
the procedure requires a good deal of computational 
ingenuity; see Casella and Moreno (2005) in the con- 
text of contingency tables. 

Although the EPP and intrinsic prior method- 
ologies produce priors that are "related" through 
a common underlying marginal data distribution, 
they do not explicitly address the issue of prior com- 
patibility across models. The latter issue is lucidly 
tackled in Dawid and Lauritzen (2001), who present 
several strategies for the derivation of compatible 
priors; see also Roverato and Consonni (2004) in 
the context of directed graphical models and Con- 
sonni, Gutierrez-Pefia and Veronese (2007) for gen- 
eral exponential families with a detailed application 
to testing the Hardy-Weinberg model in studies of 
population genetics. 

Starting with a model Ai = {f(y\X, </>)} and a joint 
distribution 7r(A,</>), we briefly review below four 
main strategies for prior specification under a nested 
model Ai* identified through = </>o- 

Marginalization (M) . This approach is most natu- 
ral under interpretation S-N where Ai* = {f*(y\X), 
A € A}, so that Ai and Ai* share the same param- 
eter A, and states that 7r M (A) = vr(A), where 7r(A) is 
the marginal of A under 7r(A, (j>). Two critical aspects 
should be taken into consideration: (i) marginaliza- 
tion does not explicitly take into consideration the 



constraint <j> = (j)Q ; in fact it disregards this informa- 
tion by averaging with respect to the distribution 
of (ft; (ii) on a more technical side, this procedure 
is not invariant to reparametrization. Consider, for 
instance, model Ai of Section 2.1, and suppose to 
recenter the data as x% —> Xi — x, with x the mean 
of the X{. The model Ai becomes (a — fix) + f3xi 
suggesting the following reparametrization: (a, (3) t— > 
(7,(5), where 7 = — fix, and 5 = fi. Notice that a 
and 7 are the same quantities under Ai* and so 
should share the same prior under the latter model. 
On the other hand, a and 7 are distinct under Ai 
and will have typically different priors, a feature 
which will be inherited under Ai* through the pro- 
cedure M, thus establishing its lack of invariance. 

Usual conditioning (UC). As with M, this pro- 
cedure applies more naturally under interpretation 
S-N, and states that 7r uc (A) = vr(A|0 = ^>o)> where 
the right-hand side is the conditional distribution 
of A given 4> = 4>q under 7r(A,</>). A clear advantage 
of UC is that it incorporates explicitly the infor- 
mation available in the specification of model Ai* , 
through the constraint <j) = <pQ. The major drawback 
of UC is that it is not invariant to the choice of 
the conditioning function (typically an event having 
zero probability) which identifies the submodel. For 
instance, assume that Ai is as in Section 2.1, and 
that (a,/3) are jointly normal with zero mean, vari- 
ances <7q, <x| and correlation coefficient p. Then the 
distribution of a given (3 = is normal with zero 
mean and variance <7q(1 — p 2 ). On the other hand, 
model Ai* could also be identified through the con- 
straint £ = 0, where £ = fi/a. It can be checked that 
the conditional distribution of a given £ = is no 
longer normal. This represents an instance of the 
Borel-Kolmogoroff paradox. 

Jeffreys conditioning (JC). This procedure is a 
variation of UC and hence is most appropriate again 
under interpretation S-N. It was proposed by Dawid 
and Lauritzen (2001) to overcome the lack of in- 
variance of UC. First recall that the density ob- 
tained through UC can be expressed as 7r u (9) cx 
tt(0), ^9*, where G* = {(A, </>), A € A, <f> = fa}. 
Now let H (9) denote the Fisher information matrix 
for 9 under A4, and similarly for H*(9) under Ai* . 
Set j(9) oc \H{9)\ 1 / 2 , where \H\ is the determinant 
of H, so that j{9) is the Jeffreys prior for 9 under 
Ai, and define analogously j*{9) under model Ai* . 
The JC density is defined as 

(1) 7r JC (0)cxvr(^)^^, 9eQ*. 
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Typically, one would re-express the JC density as 
a function of A only, and write 7r JC (A) accordingly; 
we shall follow this style in the next section. A use- 
ful feature of Jeffreys conditioning is invariance to 
model reparametrization, because of the multiplica- 
tive term given by the ratio of the Jeffreys densities. 
A potential difficulty with Jeffreys conditioning is 
that the resulting prior 7r JC (A) may be improper 
even though 7r(0) is proper, because of its nonprob- 
abilistic nature. 

Kullback-Leibler (KL) projection. This procedure 
is part of a more general approach to the construc- 
tion of priors on related models based on projec- 
tion maps, and is especially appropriate under in- 
terpretation W-N. Consider a model Ai and a sub- 
model A4* , parametrized by 9* £ 0* for the same 
observable, and suppose that each distribution in 
At has an image in At* through the (projection) 
map t : i— > 0*. Given a prior ir(0) on 0, the prior 
induced on t{9) is called the r-projection prior. 

For reasons to be specified shortly below, we shall 
take t(6) as the Kullback-Leibler (KL)-projection 
of 9 onto 0*, that is, 

T, KL (0) = arg min KL(f (-\9) , f* (-\9*)) , 

where 

* i(p , 9 ) = ^(loggi) 

denotes the KL-divergence between the density p 
and q relative to a common dominating measure. In 
this case we call the resulting prior KL-projection 
prior, or KL-prior for short, and denote it with 
7T KL (0*), that is, tt kl (6*) =7^ ± (r), where tt* ± is 
the prior on Q 1 - = t^ l (9) induced from the prior 
tt{9). KL-priors were originally presented in McCul- 
loch and Rossi (1992) to compute Bayes factors; 
they are applied in Viele and Srinivasan (2000) to 
ANOVA models, and in Consonni, Gutierrez-Peha 
and Veronese (2007) to a particular multinomial 
model. Goutis and Robert (1998) and Dupuis and 
Robert (2003) use KL-projection for comparing mod- 
els, but do not rely on the idea of KL-priors. 

Notice that KL(p, q) is not symmetric. The intrin- 
sic discrepancy between p and q, 5(p, q) = mm{KL(p, 
q), KL(q,p)} (see Bernardo and Rueda, 2002), over- 
comes this difficulty. However, we will still use KL(p, 
q) because (i) we take p as the encompassing model, 
whose validity is not questioned within our approach, 
while q is a simplified version of p; from this point 
of view taking expectations with respect to p, as 



in KL(p,q), appears a sensible procedure; (ii) for 
regular nested models (wherein the support is inde- 
pendent of the parameter), p and q have the same 
support so that KL(p,q) is well defined; (iii) the use 
of 5(p,q), instead of KL(p,q), adds complexity from 
an analytical viewpoint (for a detailed discussion 
on these points see Consonni, Gutierrez-Peha and 
Veronese, 2007). 

From our perspective, a very important feature of 
the KL-projection is its invariance to reparametriza- 
tion. Thus if rj = g(9) is a reparametrization under 
Ai, then T^ h (ri) = r^ L {g~ x (rj)) . Accordingly, prior 
assignments based on KL-projection do not depend 
on the specific parametrization that is chosen. To 
illustrate the KL-procedure, consider the simple lin- 
ear model At of Section 2.1 with the submodel speci- 
fied by Ai* D . It can be checked that the KL-projection 
of (a, (3, a 2 ) onto the space {(a*, a* 2 ) £ R x R + } is 
given by 

(a, 0, a 2 ) 1 - = ( a + /3x, a 2 + /3 2 - ^(a* - x) 2 J 



with some abuse of notation for the latter equal- 
ity. It is interesting to remark that the projection 
corresponding to the variance is given by a 2 plus a 
quadratic term: as a consequence a* 2 is stochasti- 
cally larger, under the KL-prior, than a 2 , whatever 
the prior on a 2 under At. This seems to be consis- 
tent with the views of those authors who state that 
a* 2 should perhaps be larger than a 2 , to account for 
an anticipated worse fit of the submodel; see Berger 
and Pericchi (2001, Section 1.5) and Robert (2001, 
page 349). A similar, although less stringent, view is 
held by George and McCulloch (1997) according to 
whom the expectation of a 2 under the smaller model 
should be larger. The exact form of the joint KL- 
prior for (a*, a* 2 ) is typically unavailable because 
of the complicated structure of cr 2J -; however, we 
will provide an analytical approximation in the next 
section. Alternatively, one could resort to stochastic 
simulation since a draw from 7r KL (-) can be easily 
obtained by first generating 9 from ir(-) and then cal- 
culating r^ L (0), possibly through numerical meth- 
ods. 

3.1 Coherence of Procedures With Respect to 
Nuisance Parameters 

In this section we plan to evaluate the procedures 
to construct priors under submodels from the point 
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of view of coherence with respect to the nuisance 
parameter as defined in Section 2.2. 

If 5 is a nuisance parameter, then it could be in- 
tegrated out from the very beginning (see O'Hagan 
and Forster, 2004, Sections 3.13-3.14), using a prior 
under M. . A new integrated model IA4 would then 
be obtained, which in turn generates an integrated 
submodel XM* . Let y be a future observation to 
be forecast. We say that a procedure is nuisance- 
coherent if the marginal distributions of y under 
submodel A4* and the corresponding integrated sub- 
model lAi* are the same, that is, 

(2) fM-(v)=fxM-iv)- 

In other words, integrating out the nuisance parame- 
ter "at the beginning" (using it) or "at the end" (us- 
ing the procedure-induced prior) does not make any 
difference. If (2) holds, then the predictive distribu- 
tions under the two models are equivalent; moreover, 
the Bayes factor for the pair (A4,A4*) coincides with 
that for (1M,1M*), since f M (y) = HmW) by def- 
inition of integrated model. 

The following proposition establishes results on 
nuisance-coherence for the procedures M, UC and 
JC. 

PROPOSITION 1. Consider a model M parame- 
trized by (\,5,<j)) with 5 a nuisance parameter, and 
prior ir(\,5,(j)). Let Ai* be a submodel identified 
through <j) = fa . Then: 

(i) the UC procedure is nuisance-coherent; 

(ii) the M procedure is nuisance- coherent if 5 is 
conditionally independent of (p given A under tt(X,S, 

<t>); 

(hi) the JC procedure is nuisance- coherent if the 
ratio of the Jeffreys priors relative to the pair (A4,A4* 
is proportional to that for the pair (IA4,IM*), pro- 
vided the resulting priors are proper. 

Proof. See the Appendix. □ 

In general nuisance-coherence does not hold for 
the KL-procedure; see Section 4.1.3. 

3.2 Coherence of Procedures Across Nested 
Models 

We now address the issue of coherence across a 
collection of submodels. It is actually enough to con- 
sider only three models. For simplicity of exposition 
we shall formulate the problem within interpreta- 
tion S-N (see Section 2.2). Specifically, consider the 



following models: 

(3) M: f(y\X,fa,fa), 

(4) M*: r(y\\,fa) = f(y\\,fa = ct> 1 ,fa), 

M**-. r*(v\>) = f(v\\<h = <l>i,<h = <f%) 

(5) 

= f*(v\\,<h = $), 

so that M* is a submodel of M and M** is a sub- 
model of A4* (and so also of M). Let 7r(A, fa, fa) 
be the prior under Ai, vr*(A, fa) that under M.* and 
finally 7r**(A) that under A4** . For each given pro- 
cedure to construct priors on submodels, the prior 
7r**(A) can be obtained either with respect to the 
pair (A4,A4**), which we label 7r^(A), or with re- 
spect to the pair (Ai*, Ai**), which we label 7r^* (A). 
We say that a procedure is nested- coherent if 

Proposition 2. Consider the three models de- 
scribed in (3)— (5). The M, UC and JC procedures 
are nested- coherent. 

Proof. See the Appendix. □ 

We remark that nested-coherence fails in general 
for the KL-procedure as we report in Section 4.1.3 
with reference to linear models. 

4. LINEAR MODELS 

Consider the general linear model Ai 

(6) y = X(3 + e, 

where y is an n-dimensional vector of observations 
on the dependent variable, X an (n x p) matrix of 
predictors having rank p, j3 a p-dimensional vector 
of regression coefficients and e an n-dimensional vec- 
)tor of error terms with e ~ N(0, o~ 2 I), conditionally 
on a 2 . We assume that the constant term is always 
included in the model, so that the first column of X 
is the unit vector. It is useful to think of (6) as the 
full model. 

If subjective information is limited, we can easily 
resort to conventional proper priors such as the con- 
jugate normal inverted gamma (NIGa) family; see, 
for example, O'Hagan and Forster (2004, Section 
11.4). Specifically, under a NIGa(6, V, d, a) prior, the 
conditional distribution of (3 given a 2 is N(6, a 2 V) 
while the marginal distribution of a 2 is IGa(d/2,a/2). 
Here, N(6, S) denotes a normal distribution with ex- 
pectation b and variance matrix S, while IGa(c£/2, a/2) 
stands for an inverted gamma distribution having 
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expectation a/(d — 2),d > 2. In many applications, 
and especially in econometric analysis, a simplified 
version of the NIGa prior is usually considered. The 
suggestion of Zellner (1986), called g-prior, is to set 
V = g(X T X)~ l , with g > 0. The choice of g has been 
extensively analyzed in several papers, for example, 
George and Foster (2000), Clyde and George (2004) 
and Fernandez, Ley and Steel (2001). 

Some authors have raised criticism against the use 
of g-priors for model selection (see for a clear exposi- 
tion Berger and Pericchi, 2001), and have suggested 
alternative conventional priors, such as the Cauchy 
prior by Zellner and Siow (1980), recently discussed 
in Bayarri and Garcia-Donato (2007). Liang et al. 
(2008) propose to use a prior on the parameter g 
leading to a mixture of g-priors, which includes as 
a special case that by Zellner and Siow. This prior 
does not suffer from the "information paradox" which 
represents a major drawback of g-priors; see Sec- 
tion 4.2. However, we still employ a g-prior on the 
full model because of its simplicity and analytical 
tractability. At any rate the compatible priors that 
we derive under the various submodels differ from 
the g-priors traditionally employed. 

We take as prior for (/3,<r 2 ) under A4 

(7) tt(/3, a 2 ) = NIGa(/3, a 2 ; b, g{X T X)~\ d, a), 



hierarchically specified through 
(8) 



T-)=N(p;b,ga 2 (X T Xy 1 ); 



ir(a 2 ) = IG&(a^;d/2,a/2), 

and refer informally to (7) as the gNIGa prior. 

Concerning the choice of E(j3) = b, three default 
options are 

(9) # = (0,...,0), b T = (y,0,...,0), b = P, 

where (3 represents the OLS estimate of (3 under the 
full model. In this way the elicitation of the gNIGa 
prior reduces simply to choosing the three hyper- 
parameters d, a and g. Possible choices for g are 
extensively discussed in Fernandez, Ley and Steel 
(2001). In particular, based on simulation results, 
they recommend using g = max{n,p 2 }, so that typ- 
ically g = n, because n ordinarily exceeds p 2 . 

4.1 Priors for Submodels 

We now review some techniques for prior speci- 
fication under a generic linear submodel. Let Ai k 
represent a submodel that uses p k predictors with 
p k < p. Write X = (X k :X\ k ), where X k is an (nxp k ) 



matrix. We assume that each submodel includes the 
intercept term, so that the first column of X k is the 
unit vector; for this reason there exist 2 P_1 possible 
models. Let /3 T = {P k ,P^ k ) be the partition corre- 
sponding to that of X. 

If we adopt interpretation S-N of nested models, 
we can write M k as y = X k f3 k + e, which is equiv- 
alent to the hypothesis H k : f3\ k = 0. On the other 
hand if one follows interpretation W-N, Ai k can be 
expressed as 

(10) y = X k Pl+e k , 

with e k ~ N(0, er 2 .!), and /3£ a p^-dimensional vector. 
Notice that in this setting each submodel presents 
a specific parametric representation, with a distinct 
and a 2 . To simplify the exposition, in the fol- 
lowing we will make use exclusively of representa- 
tion (10) which reduces to the S-N case by setting 
PI = (3 k and a\ = a 2 . 

It is common practice to "replicate" the gNIGa 
prior described in (7), under each M k , in particular 
using the same values of g, d and a. We will show 
that the UC and JC procedures, as well as KL based 
on a conjugate approximation, lead instead to 

(11) 

= NIGa($ , a 2 k ; b%, g k (X^X k ) ~ 1 , d k , a k ) , 

with model-specific hyperparameters. As a conse- 
quence, the marginal distribution of y is an n-dimensio- 
nal Student t-distribution and the Bayes factor for 
model M k versus model A4 S can be written as 



B ks 



C ks { a s + 9s y T M s y 
1 + 9s 



+ -r^—(y-J 



(y-XsK) 



(12) 



{d s +n)/2 



ak + 9k y T M k y 
1 + 9k 

+ — L-(y - X k b* k ) T 
1 +9k 



(v-XhbZ 



(d k +n)/2- 



where 



Cks 



T(d s /2)(a k ) d ^ 2 T((d k + n)/2)(l + g s )^/ 2 
T(d k /2)(a s r°/2T((d s + n)/2)(l + g k )v^ ' 
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with M k = I- X k {XlX k y l Xl =I-P k , where P k 
is the projection matrix onto the column space of 
X k . Accordingly y T M k y represents the residual sum 
of squares SSR k of model M k and similarly for M s . 

Notice that the marginalization procedure does 
not lead to the gNIGa prior (11). Indeed, condition- 
ally on <j k , the variance matrix of P k is given by 
ga 2 k [(X T ■ X)-% k , where [(X T X)-% k is the sub- 
matrix of (X T X)~ l containing the first k rows and 
k columns, which is not equal to {X^X k )~ l . This 
reason, together with the lack of invariance and of 
nuisance-coherence of the marginalization procedure 
in this case, suggest to disregard it in our future in- 
vestigations. 

4.1.1 Standard Approach. The conventional prior 
that is used in most Bayesian analyses of linear mod- 
els assumes that, under A4 k , (f3 k ,o~ k ) follows a gNIGa 
distribution, with hyperparameters (b k ,g,d,a), where 
the superscript S stands for "standard." Often the 
prior on a\ is taken to be improper (d — > and 
a — > 0) and the resulting prior will be denoted with 
7r I (/3^ , a k ), where / stands for "improper." Standard 
choices for b k reproduce the default options (9) and 
can be formally recovered as b k = (X k X k )~ l X k T Xb. 
Using results in Rao and Toutemburg (1999, pages 
41-42), it can be checked that when b = b the corre- 
sponding b k will coincide with the OLS estimate of 
(5 k under M. k . 

We conclude this section remarking that the stan- 
dard approach does not satisfy nuisance-coherence 
(it is enough to check that the marginal variance of 
y under M. k differs from that under IM k ); on the 
other hand nested-coherence trivially holds. 

4.1.2 Usual Conditioning. The prior for (f3 k ,o~ k ) 
in this case is given by 

^ C (^ k ,a 2 k ) = 7r(P* k ,a 2 k \f3\ k =0) 
(13) = ntft\P\ k = 0,4Ma 2 k \(3\ k = 0) 

It can be checked that the UC prior is gNIGa, that 



is. 



(14) 



with 



n k yPk^k) 

AJCtvT 



g^(XiX k )-\d 



1 jUC „UC 



a,. 



b v k c 



b k + {xlx k y\xlx Xk )b Xk , 



(15) 
(16) 



9 r 

a V k° 



dr 



d+(p-p k ) 



a + b{ k X? k M k X\ k b\ k . 



Analogous results were derived in Poirier (1985). 
Notice that under UC the hyperparameters change 
across models. In particular d k increases as p k de- 
creases (the model becomes smaller). George and 
McCulloch (1997) also allow different priors for the 
variance under the various models, although their 
choice is not based on formal probabilistic deriva- 
tions. In their case, the larger the model, the smaller 
the expected variance, which is not necessarily the 
case under UC. Notice that if b\ k = 0, one obtains 
= a /(d k c — 2), which decreases asp k decreases. 
While this feature may appear somewhat counterin- 
tuitive, it will turn out to have useful implications 
as detailed in Section 4.2. 

4.1.3 Kullback-Leibler Projection. The following 
lemma is instrumental in deriving KL-projections. 

Lemma 3. Consider the linear model A4 defined 
in (6), and the submodel M k defined in (10). Then 

(i) the KL-divergence between A4 and Ai k is 
given by 



KL{M,Mi 



1 



(XP - X k p* k ) T (X(3 - X k p* k ) 



+ 



\° 2 1 






— - log 


(5) 


- 1 









(ii) 



(17) Mgmm KL(M,M k ) = fa = (Xi X k )~ l X J k X/3; 



(iii) arg min^ ^ KL(M , M k ) = (fa , of 1 ) 
where fa is defined in (17) and 



(18) 
with 

(19) 



o- 2 + Q k W), 



Q k (P) = -faX T M k Xp 
n 



-P\k X \k M k X \k(^\k- 



Proof. Point (i) follows specializing to our case 
the KL-divergence between two multivariate normal 
distributions, given for example in Whittaker (1990, 
page 387). Points (ii) and (iii) are obtained by a 
direct calculation. □ 
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We now distinguish two cases, namely projection 
with respect to (3 k for given a k , and projection with 
respect to both f3 k and a\. Consider the former case. 
This is appropriate, for instance, when we want to 
take the same prior on a 2 for all models; in this 
case we need only minimize KL(M,A4 k ) with re- 
spect to P k and thus {3 k is given by (17) (for in- 
teresting related results, obtained using a predictive 
point of view, see Ibrahim, 1997, and Celeux, Marin 
and Robert, 2006). 

Proposition 4. Consider the linear model M 
specified in (6) with a NIGa(6, g(X T X)~ l , d,a) prior 
on {f3, a 2 ) described in (7)-(8) and a submodel hA k 
specified in (10). Conditionally on the assumption 
that Ot has the same distribution as a 2 , that is, 
lG&(d/2,a/2), the KL-prior on (P^a 2 .) is given by 

(20) NIGa($ , a 2 k ; bf L , g(X^X k )-\ d, a), 
with 
(21) 



b k + (xix k )-\xix Xk )b Xk 



-T - 



where {b k ,b^ k ) is the decomposition of b = E(j3) 
corresponding to Ai k . 

Proof. Recalling that f3u is a linear transfor- 
mation of (3, it follows immediately that the distri- 
bution of f3 k given a 2 is normal. Now E{(3 k L \a 2 ) = 
{XlX k )-^X T k XE(P) = {X T k X k )^XlXb, and (21) 
follows immediately rewriting X = (X k :X\ k ) and 
b T = (bl, b^ k ). Furthermore, \ai{^\a 2 k ) = ga\{Xl x 

Xk)- 1 Wfc, where W k = X T k P X k {X T k X k )~ x with P = 
X(X T X)- 1 X T . Let now M\ k = (I — P\ k ), where 
P\ k denote the projection matrix onto the column 
space of X\ k . Using the equality P = I — M\ k + 
M\ fe X fe (XjM\ fc X fc ) _1 XjM\ fe provided in Searle 
(1982, exercise 8, page 269), it follows that W k = I, 
which gives the result. □ 

Consider now the projection with respect to (P k ,cr k ) 
whose corresponding expressions are provided in 
point (iii) of Lemma 3. Notice that f3 k is unchanged 
relative to the previous case; on the other hand af. 1 - > 
a 2 since Q k {fi) > 0. [This follows because Q k (f3) can 
be written as w T w with w = M k X{3, using the fact 
that M k is a projection matrix.] As a consequence 
the KL-projection variance under A4 k will always 
exceed a 2 , justifying the intuition that the variance 
under M. k should be larger to account for a greater 
lack of fit. This case generalizes the simple linear 
regression example introduced shortly before Sec- 
tion 3.1. 



The KL-prior of (/3 k ,cr k ), that is, that induced 
from (7) on (f3 k , p^), is unfortunately not analyti- 
cally available, because of the awkward dependence 
of a^ 1 - on (/?, a 2 ). Of course one can easily simu- 
late from the KL-prior on (fit,o~i) using draws from 
the gNIGa prior on (f3,a 2 ) and mapping them into 
draws from 71^ through (f3 k , o~ k ' L ). However, we will 
not follow this course of action and derive an ana- 
lytical approximation along the lines described in 
Consonni, Gutierrez- Peha and Veronese (2007). Es- 
sentially, we employ a conjugate prior that mini- 
mizes the KL-divergence relative to the true vr^ L . 
We call the resulting prior the KL-conjugate ap- 
proximation, but for simplicity, we still identify it as 
7Tj[5 L . Specifically, we approximate the true KL-prior 
within the conjugate gNIGa family, whose hyperpa- 
rameters &? L ,g? L ) a? L ,(i? L are given in the follow- 
ing proposition. 

Proposition 5. Consider the linear model M 
specified in (6) with a NIGa(6, g(X T X) -1 , d,a) prior 
on ((3, a 2 ) described in (7)-(8) and a submodel A4 k 
specified in (10). Then the KL-conjugate approxima- 
tion prior on (f3* k ,a 2 k ) is the NIGa(^ L ,^ L (X^X fc )- 1 , 
d^ L ,a^ L ) where the hyperparameters can be identi- 
fied in the following way: 

• If b\ k = 0, they are the solutions of the following 
system of equations: 



(22) 
(23) 

(24) 



bf L = b k , 



9f L 



gE(R k ((3,a 2 )), 



4 L - 



1 



dE[R k {(3,a 2 )} 



iP(df L /2) - log(4 L /2) = Hd/2) - log(d/2) 



tKL 



(25) 



+ E{log[R k (/3,a 2 )]} 
-\og{E[R k {P,a 2 )}}, 



where R k {j3,a 2 ) = {l + Qk(P)/<r ) , and if) (a) = 
JMog(r(a)) is the digamma function. 
Ifb\ k ^ 0, they are approximately the solutions of 
the following system of equations: 



(26) 




(27) 




(28) 




and 





-If vT 



2W1 



E[R k (P,a 
d^ d E[R k {P,a 2 )^] 



V(4V2) - log« L /2) 



KL 



COMPATIBLE PRIORS FOR LINEAR MODELS 



11 



(29) 



i>{d/2) - log(d/2) 
+ 2 E[R k (/3,a 2 )-^- 



The analytical expressions for E{Rk(/3, a 2 )], E{Rk(f3, 
er 2 )^ 1 ] and Var[i?fc(/3, u 2 ) -1 ] are given in Lemma 
A.l in the Appendix. 

Proof. See the Appendix. □ 

Notice that both the expressions of in Propo- 
sitions 4 and 5 coincide with that of b^ c . Further- 
more, (22)-(25), as well as (26)-(29), do not admit 
a closed-form solution. Yet, a few results can be es- 
tablished which we report without proof: d^ L < d; 
d% L /d ->■ for d ->■ oo; af L — > for d ->■ oo, whence 



,KL 



< a for large d; E(a k 2 ) = d% L /af L < d/a 



E(a~ 2 ), as expected. Finally nested-coherence is sat- 
isfied on the space or regression parameter, while it 
fails on the variance space. Moreover it can be es- 
tablished empirically that nuisance-coherence fails. 

4.2 Information Paradox 

A major objection to the use of g-priors falls un- 
der the heading of Information Paradox; see Liang et 
al. (2008) for a recent discussion. Suppose that the 
regression model Mk is compared with the "Null" 
model Mq having no predictors. Assume the data 
overwhelmingly support Mk, that is, ||/3fc|| 2 = 0£pk — 
00, so that the coefficient R 2 under Mk tends to 1 
and SSRk = y T Mky — > 0. Using a g-prior under both 
models with zero expectation for the regression pa- 
rameters and dk = d, ak = a and gk = g, the Bayes 
factor Bko of M. k against A4o remains bounded 
whereas one would expect it to diverge. However, 
the paradox does not necessarily arise if we assume 
different g-priors under the two models as implied 
by the UC and KL-procedures, as we now show. 

First notice that /3j/3fc — > 00 implies also y T y —> 
00. If b = E((3) is independent of the data, for ex- 
ample; 60 in (9), it can be easily checked using (12) 
that Bko is asymptotic to 



(y T Ay) 



(d +n)/2 



(i/(i + 9fc )y T 2/) (dfc+n)/2 



with A ■ 



fjo 



n(l + g Q 



■J 



where I is the identity matrix and J is the matrix 
with all elements equal to 1. Since A m j n < y T Ay/ 
y T y < A max , where A m i n and A max are the smallest 
and largest eigenvalues of A, it follows that y 1 Ay = 



0{y T y) since A m i n > 0. As a consequence the limit- 
ing behavior of B k o depends on the hyperparame- 
ters do and dk deduced from the specific compati- 
ble procedure. In the case of UC, we have g k c = g 
and d^ c = d + (p - p k ) < d + (p - 1) = d^ c and thus 
—7- 00 so that the paradox does not arise. How- 
ever, this result does not hold for the KL-procedure, 
since > d^ L . The same conclusions can be ob- 
tained, using similar arguments, if we assume E(f3) = 
6 = (y,0,...,0). 

Suppose now E{j3) = 6, that is, the expectation of 
j5 is fully data-dependent. In this case both b^ c and 
reduce to the OLS estimate of f} k under A4 k , 
that is, ftUC = b KL = (xlX k )- l Xly, while b^ c = 
&o = V- Thus, from (12), Bko is asymptotic to 



a 



AO" 



(a + y T M y)^ + -)/ 2 



(d k +n)/2 



(30) 



with Mi 







-J 



n 



Under the UC procedure, only the hyperparameter 
a^ c can depend on the data y through b [see (15) 
and (16)], and we have 

a\ JC = a + y T MkX^(X^MkX\ k y 1 X^M k y 

(31) =a + y T (P-P k )y 

= a + y T (M k - M)y^a, 

recalling that ft™ 3 = P\ k = (X^MkX^)' 1 X^ k M k y, 
and using formula 3.98 on page 42 and Theorem 
A. 45 on page 367 in Rao and Toutemburg (1999). 
The result follows noting that y T (Mk — M)y —¥ 
because the SSR of A4 must be less than that of 
Mk which tends to zero by hypothesis. Thus Bko in 
(30) trivially goes to infinity, since Cko constant 
and y T M$y — ¥ 00, and there is no paradox. 

Under the KL-procedure instead, from (25), (58) 
and (59), it appears that the dependence of the hy- 
perparameters on the data happens only through 
Q k 0). Now 

Qk0) = ~P T X T M k Xp = -y T PM k Py 
n n 

= -y T (P - P k )y = -y T (M k - M)y 
n n 

which tends to zero as in (31). Accordingly the hy- 
perparameters behave as constants in the limit, and 
thus also in this case the information paradox does 
not arise. 
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5. EXAMPLES 

In this section we present three examples in order 
to evaluate the performance of the various priors 
discussed in Section 4.1. The first one considers the 
very simple situation of testing a normal model with 
a submodel Ai* having mean zero: in this way dif- 
ferent priors of a* 2 can be more easily compared. 
Features of the priors, and their consequences on 
variable selection, are then assessed in a more com- 
plex simulation study, and in a real data set (Hald 
data), frequently analyzed in the literature. 

5.1 A Simple Illustration 

Consider the two models 

M: yi = n + £i, £i~N(0,o- 2 ), 

M*: yi = e*, £ *il d N(0,a* 2 ), 

with i = 1, . . . , n, and assume as a prior for (/x,cx 2 ) 
under Ai the following gNIGa: 7r(/i|cr 2 ) = N(/x; 0, go 2 / 
n); vt(cj 2 ) = IGa(o- 2 ; d/2, a/2). If St n (-;rj,A,u) de- 
notes an n-dimensional Student t-distribution with 
expectation rj, degrees of freedom v and variance 
matrix uA~ l /(u — 2), v > 2, the marginal density of 
V is 

f(y)=St n LoMl-^j\d\ 

The submodel Ai* only requires a prior on a* 2 . 
The Standard, UC and KL-procedures lead to pri- 
ors 7r s , 7r uc and 7r KL for a* 2 which are all of type 
IGa(d*/2, a*/2). Specifically, one obtains 

(d s = d, a s = a); 

(32) 

(d uc =d+l, a uc =a). 

We consider also the typical improper prior on a 2 
given by it 1 (a 2 ) oc a~ 2 which can be formally ob- 
tained from 7r s setting d = 0, a = 0. Consider now 
the KL-prior. A direct computation yields a 2± = 
a 2 + /U 2 , which can also be deduced from (18) by set- 
ting Pk equal to the zero matrix since in this case 
is void, so that Qk(lj) = A* 2 - The values of d KL and 
a KL can be recovered from (24) and (25). For illus- 
tration, in the following we use three different values 
of (d, a), namely (d = 1, a = 1), (d = 5, a = 1), (d = 
3, a = 25) leading respectively to (d KL = 0.93, a KL = 
1.42), (d KL = 3.38, a KL = 1.03), (d KL = 2.36, a KL = 
29.98). 

In order to appreciate the effect of the different 
priors, we compute the posterior probability of the 



two models Ai and Ai* . In particular assuming prior 
odds 1, we have Pr(A%) = 1/(1 + B*), where B* = 
f*(y)/f(y) is the Bayes factor of Ai* versus Ai. 
Notice that f*(y) = St n (y;0,(d* /a*)I,d*) with d* 
and a* depending on the specific procedure. We fix 
n = g = 25 and perform a simulation study, generat- 
ing a vector e from a multivariate standard normal 
distribution, and set y = [ii n + e, where i n is the 
n-dimensional unit vector. In Figures 1 and 2 the 
posterior probability of Ai is plotted ELS £1 function 
of /i. Notice that the minimum of the curves does 
not occur at fi = 0, because the generated errors in 
the simulation had a negative mean of about —0.5. 
Ideally the posterior probability curve should reach 
a minimum close to zero for [i ~ and then increase 
rapidly as [i moves away from zero. When d = a all 
curves overlap to a large extent. Differences emerge 
for unequal a and d with the curves correspond- 
ing to 7T 1 and 7r s occupying intermediate positions, 
while those associated to 7r KL and 7r uc represent 
"extreme" curves. A strong sensitivity of 7r uc and 
7T KL is apparent and in particular when a is greater 
than d, 7r uc favors M* most strongly, while 7r KL fa- 
vors Ai (and conversely when d is greater than a). 
For a > d, the curve corresponding to 7r s is some- 
what flatter than that under 7T 1 . 

We now consider the problem of model compari- 
son from a predictive viewpoint as described in 
Gelfand and Ghosh (1998); see also Marriot, Spencer 
and Pettitt (2001). In the simple case corresponding 
to squared error loss, each model Aik is assigned a 
score made up of two parts: an error sum of 

squares component and a predictive variance 

component PW, 

(33) = -^—G {k) + P (fc) , c>0, 

c + 1 

where 

n 
i=l 

i=l 

l*\ k) = EW(y itiep \y), 
a^=V^ k \y iirep \y). 

In the above setting y T = (yi, . . . , y n ) are the data, 
while y^rep represents a future replicate observation 
(the number of replicates being equal to that of the 
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data). Model selection is achieved through a mini- 
mization of for a given choice of c. The term 
p( k ) represents a penalty which aims at discourag- 
ing models that either strongly underfit or overfit 
the data, because in both cases predictive variances 
will tend to be inflated. Since our objective is to 
compare the performances of the various priors un- 
der model M* we simply need to evaluate D* for 
each distinct prior. 



Consider first /i*. This is 

fi* = E*(y hIcp \y)=E*[E*(y^ cp \y,a 2 *)\y] 

= E*[E*(y i}rep \a 2 *)\y} = 0, 

since under Ai* each observation has expectation 
zero, conditionally on a 2 *. As a consequence D* = 
P* + X2fc=iJ/?j an d thus only the term P* matters 
for comparison purposes. Now 

of = Var*(y i>rep |y) = E*[Vnr*( yi , Tep \y, a 2 *)\y] 





i- 






\q.8 

\ \ \\ 
'■ \ \\ 
■■ \ 4 
y \ \ 


/ /-' 

// / ' 
h 
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Fig. 2. Posterior probability of M for hyperparameters d = 3, a — 25/ p Klj {M\y) dash thick, p s (M\y) solid thin, p vc (M\y) 
dash thin, p I (M\y) solid thick. 
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E* 



2* 



cr 



Iv) 



d* 



2' 



£ - 2 > 0, 



since under each prior the posterior distribution of 
af is IGa(d*/2, a*/2), with d* n = d* + n, and a* = 
a* + X^iLi^- I n conclusion the predictive criterion 
of Gelfand and Ghosh (1998) suggests to base model 
comparison on P* = na^/ (d* n — 2). 

From (32), it is immediate to conclude that _P UC < 
P s so that 7r uc supports M* more than 7r s . On the 
other hand, since d KL < d it follows that P KL > P s 
whenever a KL > a s (calculations show that this oc- 
curs for moderate values of d, specifically d < 5.45); 
in other words the KL-prior would tend to favor 
Ai* less than 7r s . These conclusions are broadly in 
accord with the curves describing P(Ai\y) depicted 
in Figures 1 and 2. 

5.2 Simulation Study 

As a second example, we consider a simulation 
study along the lines presented in George and Mc- 
Culloch (1993), Raftery, Madigan and Hoeting (1997) 
and Fernandez, Ley and Steel (2001). We consider 
p = 6 predictors, the constant plus (X\, . . . , Xq) and 
n = 30 observations. Let Zj, j = 1, . . . , 5 be indepen- 
dent n-dimensional vectors, whose components are 
independent standard normal variables, and set 

Xi = Zi, X2 = Z2, X 3 = Z 3 , 
(X 4 ,X 5 ) = (X 1 ,X 2 )(0.3 0.7) T (1 1) + (Zi,Z 5 ). 

In this way there is a correlation between the first 
two predictors and the last two. We generate the 
response y according to three different models: 



(34) Mi 

(35) M 2 

(36) M 3 



y = C + 2.5e, 

y = C + 2X 1 -X 3 + 1.5X 5 + 2.5e, 

y = C + 2X 1 -X 3 + X 4 + 1.5*5 + 2.5e, 



where C is a fixed constant and the n elements 
of e are independent standard normal variables. In 
particular, the case in which the data were gener- 
ated from Ai\ was analyzed in a frequentist way by 
Freedman (1983). He showed that, under this "null 
model," standard variable selection procedures, such 
as stepwise regression, may lead to misleading re- 
sults, for example, retaining a subset of predictors 
with a highly significant -F-statistic and reasonably 
high R 2 . 

In order to compare the different priors, we con- 
sider the Bayes factor for each submodel versus the 



full model with six predictors (including the con- 
stant) for 50 simulated data sets and report the fre- 
quency of times in which the highest Bayes factor 
is associated to the correct model (i.e., the model 
which has generated the data) . We fixg = )i and for 
each choice of E(/3), namely bo,b,b [see (9)] check 
the robustness of the various priors to the choice 
of the hyperparameters (d, a) of the inverse-gamma 
distribution on a 2 (each time leaving unchanged the 
values of the predictors). 

We can summarize our results, which are in part 
reported in Table 1, as follows: 

(i) 7r uc appears to be the least robust prior rel- 
ative to the various choices of E(/3) and (d,a); this 
is consistent with the fact that the marginal of the 
data under 7r uc is more peaked on its expectation; 
see the discussion in Section 5.1. Its frequency of 
correct model identification can reach very low val- 
ues especially when d exceeds a, in accord with the 
fact that as d increases relative to a larger mod- 
els receive greater support under 7r uc ; see Figure 
1. To provide an explanation of this phenomenon, 
consider the Bayes factor B^ of the submodel Aik 
versus the full model Ai. If the prior under Aik is 
obtained through UC, then calculations show that 



(37) 



Bi 



TT 



W\k = 0\y) 



TT 



(/V = o) ' 



where 7r(/3u = 0\y) and n(f3\k = 0) are respectively 
the marginal posterior and prior density of /3\f., eval- 
uated at the value 0. The expression (37) for B k is 
known as "Savage's density ratio"; see, for exam- 
ple, O'Hagan and Forster (2004, Section 7.16). Now 
if the data are at least moderately more informa- 
tive than the prior, the numerator will be essen- 
tially dominated by the likelihood, and thus will be 
fairly robust to prior specifications, while this does 
not clearly occur for the denominator. In particular, 
if d increases relative to a, the distribution of a 2 
tends to concentrate on smaller values, so that the 
marginal of /3\^ becomes more peaked around the 
mode (which coincides with under bo or b), thus 
lowering B^, and supporting Ai more than Mk- 

(ii) 7r KL is reasonably robust and shows good per- 
formance, save when the generating model corre- 
sponds to the "null model" M.\ and a is large (this 
is in accord with the fact exhibited in Figure 2 that 
for large a bigger models are preferred under 7r KL ). 

(iii) 7r s and n l exhibit a relatively similar behav- 
ior, as already remarked in the previous section, and 
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Table 1 

Frequency of correct identification of the true model Mi (i — 1, 2, 3) with g — n = 30 for various compatible priors and 

different choices of (d, a) and E(/3) 



d 




a 




7T 












7T 






7T 1 




b 


b 


b 


bo 


b 


b 


bo 


b 


b 


bo 


b 


b 


K A 

Mi 


true model 




























U 
























0.60 


0.00 


n k a 
0.54 


1 




1 


0.24 


0.40 


0.24 


0.56 


0.54 


0.52 








0.76 








1 




10 


0.08 


0.24 


0.48 


0.64 


0.56 


0.56 


0.32 


0.26 


0.64 








5 




5 


0.26 


0.44 


0.24 


0.50 


0.50 


0.48 


0.06 


0.06 


0.86 








10 




l 


0.34 


0.48 


0.30 


0.40 


0.36 


0.36 








0.96 








10 




50 


0.04 


0.06 





0.56 


0.54 


0.52 


0.46 


0.42 


0.60 








* A 

Ml 


true model 






















































0.46 


O.OD 


0.60 


1 




1 


0.70 


0.60 


0.66 


0.48 


0.58 


0.60 


0.68 


0.68 


0.32 








1 




10 


0.70 


0.66 


0.64 


0.42 


0.54 


0.58 


0.58 


0.62 


0.56 








5 




5 


0.68 


0.60 


0.68 


0.58 


0.60 


0.62 


0.64 


0.64 


0.18 








10 




1 


0.66 


0.52 


0.60 


0.62 


0.64 


0.66 








0.02 








10 




50 


0.66 


0.68 


0.68 


0.50 


0.58 


0.60 


0.58 


0.62 


0.58 








Ms 


true model 






















































0.26 


0.38 


0.54 


1 




1 


0.64 


0.44 


0.66 


0.26 


0.42 


0.54 


0.68 


0.68 


0.26 








1 




10 


0.74 


0.54 


0.52 


0.24 


0.36 


0.05 


0.30 


0.42 


0.54 








5 




5 


0.54 


0.32 


0.50 


0.40 


0.54 


0.56 


0.64 


0.64 


0.04 








10 




1 


0.22 


0.16 


0.66 


0.56 


0.56 


0.60 

















10 




50 


0.74 


0.56 


0.60 


0.34 


0.48 


0.54 


0.50 


0.54 


0.52 









have a better performance than the other priors at 
identifying the "null model." 

Overall, the frequency of correct model identifi- 
cation is comparable, or even superior, to similar 
investigations carried out in a Bayesian framework, 
although using different model choice criteria and 
different priors; see Marriot, Spencer and Pettitt 
(2001). 

5.3 Hald Data 

Our third example involves the Hald data, often 
analyzed in the literature, in order to evaluate model 
selection procedures; see, for instance, Draper and 
Smith (1981). It consists of 13 observations on one 
response variable with four predictors. A specific 
feature of this data set is represented by the strong 
correlation between X\ and X% and between X2 and 
X4. We consider all the possible 16 models in which 
the constant term is always included. 

A detailed subjective Bayesian analysis of this data 
set has been performed in Laud and Ibrahim (1995, 
1996) and Ibrahim (1997), especially in terms of 
prior specification. We follow Laud and Ibrahim (199£ 



and fix a prior on (/3, a 2 ) under the full model which 
is a NIGa(6,5(X T X)~ 1 ,25,125) with E{f3) = b = 
(X T X)~ 1 X T ri, where 17 is a subjective prediction 
for y given by 77 = (79, 77, 104, 90, 99, 108, 105, 73, 93, 
111, 88, 115, 113). We also report the value 7 = 1/(5+ 
1), which represents a weight on the prior guess rj. 
Notice that the choice of d = 25 and a = 125 implies 
E(a~ 2 ) = 0.2 and Pr(cj- 2 < 0.5) » 0.95. 

Table 2 summarizes the results of a Bayesian anal- 
ysis using the conventional value g = n = 13, as well 
as g = 9 (Ibrahim's choice) which correspond to 
weights 7 = 0.07, respectively 0.10, representing weak 
prior information. Moreover we consider two choices 
for E(j3), namely b and b. We do not report explic- 
itly results for E{0) = 60 because posterior model 
probabilities are relatively more diffuse and no sub- 
set of models emerges as a clear winner. The column 
7r Ibr reports the results obtained in Ibrahim (1997) 
which assumes a fixed a~ 2 = 0.2. The highest prob- 
ability is given to model {1, 2} under all priors, save 
for 7r KL that indicates a slight preference for more 
complex models, for example, {1,2,4} for g = 13. 
Overall there is broad agreement with standard fre- 
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quentist model selection procedures as reported in 
Laud and Ibrahim (1995, Table 1). 

We also performed a sensitivity analysis (not re- 
ported here) with respect to 7 (0.01 < 7 < 0.95) for 
the two choices E(f3) = bo, respectively 6, in order 
to make a comparison with the results of Tables 2 
and 3 of Ibrahim (1997). The results are appreciably 
sensitive to the choice of 60 or b, although this fact 
is definitely less manifest for the prior 7r Ibr (under 
which, however, a 2 is assumed fixed). Overall it is 
confirmed that the choice of &o is the least satisfac- 
tory, as it tends to shift posterior model probability 
toward "extreme" models, such as the null or full 
model, when 7 approaches either boundary. On the 
other hand, under b the results are fairly insensitive 
to the choice of 7 as far as the identification of the 
top model is concerned, which is usually {1,2}, and 
either {1, 2, 3} or {1,2, 4}. In particular 7r KL exhibits 
a high stability, with respect to 7, of the posterior 
probability mass on the top model which always con- 
tains three predictors. 

The Hald data have been also analyzed in a Bayesian 
objective framework, in particular by Berger and 
Pericchi (1996b) using intrinsic Bayes factor, and by 
Casella and Moreno (2006) and Moreno and Giron 
(2007) using intrinsic priors. The models they iden- 
tify are essentially those exhibited as most probable 
in Table 2. However, under their approach, model 
{1,2} receives a posterior probability in excess of 



50%. Based on an objective predictive approach, 
Barbieri and Berger (2004) develop a theory for model 
choice. They show that the optimal model is not nec- 
essarily the highest posterior probability model, but 
rather the "median probability model." For the Hald 
data the latter is represented by {1,2,4} which, cu- 
riously, is also the model with the highest posterior 
probability under the KL-prior with g = n; see Ta- 
ble 2. 

6. DISCUSSION 

For a given proper prior on the parameter space of 
a full model, we reviewed and analyzed procedures 
for the specification of prior distributions on the pa- 
rameter space of a collection of submodels. We pre- 
sented two interpretations of nested models, in or- 
der to explicate more naturally the rationale of each 
procedure. In particular, we investigated four meth- 
ods for the specification of a compatible prior under 
a submodel, namely marginalization, usual and Jef- 
freys conditioning and Kullback-Leibler projection. 
Next, each procedure was evaluated from two per- 
spectives, nuisance- and nested-coherence. Given a 
full linear model with a normal inverted gamma g- 
prior on the parameters, we considered the problem 
of variable selection, and applied the above proce- 
dures for the construction of priors under each sub- 
model Adk- For completeness we also considered, for 
each VWfc, a g-prior on the regression parameters 



Table 2 

Posterior probability of top four models with g — n — 13 (7 = 0.07,) and g = 9 (7 = 0.1) for various compatible priors and 

different choices of E(/3); in first column is Ibrahim's results 



Model 



9= 13 

{1,2} 

{1,4} 

{1,2,3} 

{1,2,4} 

{1,3,4} 




0.175 

0.181 
0.184 
0.169 


0.203 

0.227 
0.234 
0.174 


0.340 

0.145 
0.151 
0.127 


0.290 

0.207 
0.220 
0.155 


0.276 

0.167 
0.174 
0.147 


0.293 

0.211 
0.223 
0.146 


0.329 
0.221 
0.112 
0.114 


0.271 

0.213 
0.229 
0.153 


Total 




0.709 


0.838 


0.763 


0.872 


0.764 


0.873 


0.776 


0.866 


.9 = 9 

{1,2} 

{1,4} 

{1,2,3} 

{1,2,4} 

{1,3,4} 


0.272 

0.215 
0.214 
0.164 


0.217 
0.171 
0.157 
0.156 


0.210 

0.230 
0.216 
0.173 


0.310 
0.165 
0.143 
0.143 


0.262 

0.215 
0.209 
0.163 


0.238 

0.171 
0.171 
0.153 


0.268 

0.222 
0.217 
0.157 


0.294 
0.219 
0.111 
0.111 


0.248 

0.219 
0.213 
0.159 


Total 


0.865 


0.701 


0.829 


0.761 


0.852 


0.733 


0.864 


0.735 


0.839 
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combined with an inverted gamma (d,a) distribu- 
tion on o"?, labeled 7r s , as well conventional 
improper prior on cr?, identified with tt 1 . 

Three examples were used to illustrate the behav- 
ior of the various procedures for prior specification, 
leading to the conclusions that results are quite sen- 
sitive to the choice of the hyperparameters. Over- 
all the improper prior tt 1 performs comparably to 
the standard prior 7r s , when d and a are similar. 
The usual conditioning prior 7r uc , despite its the- 
oretically attractive coherence properties exhibited 
in Propositions 1 and 2, shows remarkable sensitiv- 
ity to the choice of the hyperparameters, oscillating 
between highly simple and complicated models. The 
Kullback-Leibler projection prior exhibits a perfor- 
mance which is comparable or superior to that of tt s 
when using the OLS estimate as prior expectation 
on f3, provided that the true model is not very close 
to the "null" model with no predictors. This is con- 
sistent with the general attitude of the KL-prior to 
favor more complex models. 

When the goal of model choice is prediction, one 
might consider orthogonalizing the matrix of predic- 
tors, as in Clyde, DeSimone and Parmigiani (1996). 
In this case a (/-prior on the regression coefficient 
under the full model admits a diagonal variance ma- 
trix. As a consequence the M, UC and KL-procedures 
would generate the same prior under each submodel 
A4k conditionally on a\\ yet they would imply dis- 
tinct priors for the variance. We remark, however, 
that this approach cannot be implemented in a vari- 
able selection problem, where the focus is on the 
original predictors. 

Consistency of the posterior distribution on model 
space under different choices of the hyperparameter 
g* k in the gNIGa prior (11), with d^ = d and = a, 
has been recently discussed in Fernandez, Ley and 
Steel (2001). They prove, under mild conditions, 
that consistency obtains under both the standard 
and improper priors tt s and tt 1 . Using similar argu- 
ments one can prove that the same result holds for 
the UC procedure under bo and b, defined in (9). 
As far as 7r KL is concerned the limiting probabil- 
ity of model Mk is zero provided the true model 
is not nested within Mk', on the other hand when 
Mk is moderately larger than the true model this 
result may fail, and tt kl may lead to choose slightly 
overparametrized models. 

It is well known that a standard use of (/-priors for 
variable selection cannot be recommended because 
it suffers from the information paradox. However, 



our analysis shows that, when g-priors under sub- 
models are derived using compatibility criteria, the 
paradox either does not arise (UC procedure), or 
can be avoided (KL-procedure) through a suitable 
choice of the initial hyperparameters. 

Recent contributions in the area of linear mod- 
els (see Liang et al., 2008 and Bayarri and Garcia- 
Donato, 2007), suggest to use a noninformative im- 
proper prior on the nuisance parameter and a proper 
mixture of g-priors on the regression coefficients. It 
would be interesting to apply the methods discussed 
in this paper to the latter distribution of the re- 
gression coefficients in order to derive a compatible 
mixture of g-priors under the various submodels. 

APPENDIX 

Proof of Proposition 1. Assume that the 
sampling distribution under model M is {/(y|A, 5, (/>)}, 
where 5 is the nuisance parameter. Then, for a given 
prior tt(X,5, </>), the integrated model XM has sam- 
pling distribution f(y\\,<p) = f f(y\X, 5, ^)tt(S\X, 
4>) d8, while the corresponding integrated submodel 
XM* has density f*(y\X) = f(y\X,4> = 4> ). Let the 
prior under XM. be ttxm (A </>) = k(X, 4>) , that is, the 
marginal distribution of (X,<p) under tt. Consider 
now a procedure to construct a prior under a sub- 
model. Let 



f* M *(y) 



(38) 



f*(y\X,5)n* M *(X,8)dXd6 
f(y\KS,cl) = (j)o)ir* M *(X,S) dXd5 



and 



fxM* (v) 

t(y\X)ir XM *(X)dX 



(39) 



f{y\X, S,(f> = cpo)n(5\X, (J) = 4> ) d5 



TT- 



XM 



(A) dX, 



where 71"^,, (A, 5) is the output of the procedure ap- 
plied to (M, M*) starting from 7r(A, 5, 4>), while tt X m* 
is the output of the procedure applied to (IM,XM*) 
starting from 7r(A, </>). 
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(i) Recall that (A, S) = 7r(A, 8\<f> = 0o) an d con- 
sider vr^c „. We have 7r^,.(A) = 7rj^(A|(^ = O ) = 
7r(A|0 = 0o ). As a consequence we get from (38) 



(40) 



5, = o )tt(<5| A, = 0o ) d<5 



id* 



• 7r(A|0 = o )dA, 
while from (39) we get 

/D&.(y) 

(41) =/{/ /(y|A,<5,0 = 0o)vr(5|A,0 = 

• 7r(A|0 = o )dA, 

and the two densities clearly coincide. 

(ii) Recall that 7r^*(A,<5) = 7r(A,<5). Consider now 
^XM*(X): this is the marginal of ttxm(X,8)', the lat- 
ter, however, coincides with the marginal 7r(A, 5) un- 
der the prior 7r(A, 5, 0) by definition of integrated 
model. We therefore obtain i^ M *{\) = tttm(X) — 
7r(A). From (38) we get 



/&(!/) 



/(y|A,<5,0 = o )7r(5|A)7r(A)d(5dA, 



while from (39) we get 



5, = 0o)vr(<5| A, = O ) d5 



■ir(\)d\. 

Inspection of fj^*(y) and f^ M *{y) reveals that if S 
is conditionally independent of given A, the two 
densities are equal, 
(hi) Recall that 



7r$,(A,<5)cx7r(A, 



3M*(\&) 



3m(\5, 0o)' 

where the j-functions are the Jeffreys priors. Passing 
to the integrated model we therefore obtain 

JXM* (A) 



7r^*(A)oc7rx7w(A|0 = o )^ 

JXA4(A,0o) 



Let 



HX,S)= iM : (X x i\ , g(X)- JXM * W 



3m(X,5, o ) : 



Jxx(A,0o)' 



Clearly, if h(\, S) oc g(X), then and f x %*(y) 

have a representation as in (40), respectively (41), 
with the integrand in each case multiplied by 5(A), 
and therefore they must coincide. □ 

Proof of Proposition 2. Start with the M 
procedure. Notice that 7r^(A) = vr(A). On the other 
hand 7r^»(A) =7r*(A), where 7r*(A) is the marginal 
prior on A under 7r*(A,02); but the latter is under 
M equal to 7r(A,02), whence 7r^»(A) =7r(A), thus 
establishing the result. 

Consider now the UC procedure. We have vr^(A) = 
7r(A|0i = 0?,02 = 0;])- On the other hand 



TT^(A) 



7T*(A|02 
7r(A,02 = 



7T* (A, 02 = 0^) 

= 4) 



7T 



0^|01 = 0l) 



7r(02 

= 7r(A|0l = <Pi,<P2 ■ 

which establishes the result. 

Finally consider the JC procedure. We have 

(42) vr^(A)ocvr(A|0 1 =0° - JM " W 
On the other hand 

(43) 7r^(A)avr^(A|02 



i(A,0?,0^) 



3M**{X) 
3M*(\4>1) 



where 7r^ /( (A|02 = 0^) is proportional to the JC prior 
under the A4* model, evaluated at (02 = 0°), namely 
vr. 



lA4 (A,0 2 = 02), where 

7rJU(A,02) OC7r(A,0 2 | 



^3M*{X,(j)2 



" i; j(A,0?,0^)' 
Substituting into (43), one obtains (42). □ 

Lemma A.l. Assume (J3,a 2 ) ~ NIGa(b,g(X T ■ 
X)-\d,a) and set R k {p,o 2 ) = (1 + Q k (f3)/a 2 y\ 
with Q k (j3) = (3 T X T (I - P k )X(3/n. Then, given a 2 , 
Q k ((3)/a 2 ~ {g/n) X 2 p _ k {5), with 5 = nQ k (b)/(ga 2 ), 
where Xp- k (°~) * s a chi-squared distribution with (p — 
k) degrees of freedom and noncentrality parameter 5. 
As a consequence 

E[R k {p,o- 2 T l ] 

(44) 



-(p-k) + Q k (b)- 
n a 



(45) 



Var[i? fc (/3,c7 2 ) 

= -Qk(b) 
a 



Qk(b) 



2g 
+ — 

n 
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Furthermore 



(i) */ b\ k = 0, then R k ((3,a 2 ) = [1 + [g/^W]- 1 
with W distributed as a (central) X„- k > whence 



(46) 



{ 2g 

n 



r i 



-(p-k)/2 



I n 

cxp U 



(i) If b\ k = 0, then 5 = 0, so that W = (n/g)Q k (/3)/ 
a 2 is distributed as a (central) xf,- k - Thus E[R k (f3, 
a 2 )] = E w [(l + (g/n)W)~ 1 ] whose analytical expres- 
sion is given in (46). 

(ii) If b\ k + 0, writing E[R k ((3,a 2 )} = E[l/R k (P, 
and recalling that the first-order approxima- 



te 



2\-n 



p — k n 



2gJ' 



where F(a,z) = exp(— t)t a ~ l dt is the incomplete 
gamma function. 

(ii) If b\ k ^ 0, then the first- order approximation 
of E[R k ([3,o~ 2 )] given by the delta method is 



E[R k ((3,a 2 )] 



1 



2\-ll 



(47) 



E[R k (P,a 2 ) 



l + %-k) + Q k (b)- 
n a 



tion gives E(l/W) ~ 1/{E(W)) for an arbitrary ran- 
dom variable W, we obtain (47). □ 

Proof of Proposition 5. The NIGa(6 fc , 
g k (X k r X k )~ 1 ,d k ,a k ) distribution on {(3 k ,o- k ) can be 
written as 

n(Pk,°k) <xexp\--^a + b T X k r X k ^ 

1 f3* k T Xl X k l3* k 
a 2 



+ 



2g u k 
d + p + 2 



0~l 



Proof. First of all notice that because X= thus it belongs to the exponential family with "canon- 



[X k :X\ k ] and (3 = [f3f:^ k ] T , we have Q k (f3) = 
f3 T X T M k Xp/n = ^ k X^ k M k X\ k f3\ k /n. Now f3\ k \a 2 
is distributed according to a N(6w., go- 2 T>\ k ) with 
E\ k = (X^ k M k X\ k )~ l (see Searle, 1982, Section 10.5), 
and consequently (n/g)Q k ((3)/a 2 given a 2 is dis- 
tributed according to a Xp- k ($) distribution, where 
p — k are the degrees of freedom and 5 = (n/g) x 
Qk(b)/o~ 2 is the noncentrality parameter (see Muir- 
head, 1982, page 26). Now recalling that the ex- 
pected value and variance of a Xp- k ($) distribution 
are respectively p — k + 5 and 2 (p — k) + 45, (44) 
follows immediately from E[R k (f3,a 2 )^ 1 ] = E a ' 2 {1 + 
E^ 2 [Q k {/3)/a 2 }}, and E{l/a 2 ) = d/a. 
Similarly (45) follows from 



ical statistics" given by 1/a 2 , f3*/a 2 , (3fX^X k ^/al 
and log(l/o"|). Applying Theorem 1 of Consonni, 
Gutierrez-Peha and Veronese (2007), it follows that 
the KL-divergence between 7r KL and a NlG&(b k , g k (X k ■ 
A^) _1 ,d k ,a k ) distribution is minimized for values 
&j^ L ,5fe >^fc and a^ L which are a solution of the 
following system: 



Var[R k (P,(T 



2\-li 



Var[Q fc (/3)/a 2 



Var ff 
+ E a ° 
4ivar^ 2 



HP) 



(48) 



(49) 



(50) 



(51) 



E^(l/a 2 ) 

= E^(l/a 2 ), 
E KL (P* k /a 2 ) 

=E™ G »(p* k /*D, 

E KL ((3fxlX k (3* k /a 2 ) 

= E™ G *(P* k T X k T X k p* k /4), 
E KL (log(l/a 2 )) 



E 



NIGa 



(log(V^)), 



Var' 



n 



+ E a 



p-k + ^Q k (b) 
go 1 

nQ k (b) 



where i? KL denotes expectation w.r.t. the KL-projec- 
tion prior induced by the NlG&(/3,a 2 ;b,g(X T X)~ 1 , 
d,a), while E mGa denotes expectation w.r.t. the 
NIGa(/?*, a 2 ; b^,gf L (X^X k )-\d^, af L ). Recalling 



r 2± 



2(p - Ar) + 4 



ga z 



and Var(l/cr 2 ) = 2d/a 2 . 



(17) and (18), that is, p£ = (X£ X^X^ Xp, a 
o~ 2 + QkW)i we can compute the terms involving 
E KL in the previous equations substituting (P^,a k ) 
with the corresponding expression of {P k ,o' k l - L ) and 
using the prior n(/3,a 2 ). 
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First of all recall that if Y is a normal vector 
with variance matrix E, then Y 7 AY and CY are 
stochastically independent if and only if CY*A = 0; 
similarly Y T AY and Y T DY are stochastically in- 
dependent if and only if AT,D = (with A, C and 
D being suitable matrices). It follows that under 
7r and given a 2 , Q k ((3) and /3jjr as well as Q k ((3) 

and (5 k T X k X k f3 k are independent; the latter im- 
plies that also f3 k T X k X k j3 k and are indepen- 
dent, given a 2 . The proof is a straightforward cal- 
culation. 

Consider now (49). The left-hand side is equal to 



E 



2-L 



<T 2 +Qk(P)\ 

■E^ 2 [(X T k X k )-'X T k XP) 

= {XlX k )- 1 XlXbE(l/al), 

while the right-hand side is equal to E m (f3 k /a k ) = 
bf L E mGa (l/al). Using (48) it follows that 

(52) bf L = (XlX^XlXb. 

Consider (50). First of all notice that, using (17), 
f3 k iT X T Xf3 k L = f3 T X T P k Xf3. Thus the left-hand side 
can be written, recalling the independence of a 2± 
and p£ X T X/3 k L , given a 2 , as 

E^ [E^ 2 (1 lat)E^ 2 {fX T P k Xfi)\ 

= E° 2 {E^ 2 [(l/a 2± ) 

■(ti(a 2 gP k P) + b T X T P k Xb)]} 

= E° 2 [{kga 2 + b T X T P k Xb)E^ 2 (l/a 2± )} 

= kgE[R k (f3,a 2 )]+b T X T P k XbE(l/a 2± ), 

where R k ((3,a 2 ) = [1 + Q k (P) / a 2 ]" 1 . 
The right-hand side is equal to 

E°l[(l/o 2 )E^l(pfx T k X k Pi)} 

= E°l{(l/a 2 )Mo 2 g^(X T k X k )-\xlX k )) 

+ b^ T {xlx k )b^]} 

= kgf L + b T X T P k XbE°l(l/a 2 ), 

substituting the expression of given in (52). 
Equating the left- and right-hand sides and using 
(48) we obtain 

(53) g« L = gE[R k (J3,o 2 )]. 



Consider (51). The left-hand side can be written 
as £[log(l/<7 2 )] + E[log(R k (f3,a 2 ))] with £[log(l/ 
a 2 )} = *(d/2) - Iog(o/2), where *(a) = & ■ 
log(r(a)) is the digamma function. The right-hand 
side is equal to *(df L /2) - log(a^ L /2) and thus we 
obtain 



(54) 



(d/2) - log(a/2) + E[log(R k ((3, a 2 )] 



*(4 L /2)-logK72). 



KL 



Assume now that b\ k = and consider last (48). 
First notice that the left-hand side can be written as 
E{R k ((3,a 2 )/a 2 ], while the right-hand side is equal 
to djp/ajp. Since, from Lemma A.l, R k ((3,a 2 ) is 
independent of a 2 when b\ k = 0, (48) becomes 

(55) E(l/o 3 )E[R k {p,o*)] = 
which implies 

(56) 



KL /KL 



d^/a 



4 L 



k dE[R k (P,a 2 )} 



Substituting (56) into (54) we obtain 

*(4 L /2)-log(4 L /2) 

(57) = *(d/2) - log(d/2) + E{\og[R k {p, a 2 )]} 

-\og{E[R k (f3,a 2 )}}. 

Consider now the case b\ k ^ 0. In order to obtain 
an explicit expression of (53), we use the approxi- 
mation of E[R k (f3, a 2 )] given in (47), so that 



(58) 



E[R k \(3,a 2 )] 



9 



[l + g/n(p-k) + Q k (b)d/a]' 

Furthermore, we can still use (55) as an approxi- 
mation of (48) to the first order. Thus we have 

a 1 



4 L 



dE[R k (f3,a 2 )} 



(59) 



df L - d E[R-\(3,a 2 )} 



d 



KL' 



d 



l + %-k) + Q k (b)- 
n a 



using (47). 

Using the first approximation of (59), formula (57) 
still holds in an approximate way. 
Finally (57) reduces to 

*(4 L /2)-log(4 L /2) 

lVar([i4(/3,cT 2 )]) 



(d/2) - log(d/2) 



2 E[R k ((3,a 



2YI2 
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using the further second-order approximation 
E[\og(U)) m]ag[E(U)] - (1/2) V&r(U)/[E(U)} 2 , for 
a positive random variable U. 

Since Var(C7) = Var(l/C/~ 1 ) » [l/^C/" 1 )] 4 • 
Var(C/- 1 ) and E(U) = Eil/U' 1 ) « l/^C/" 1 ) we 
conclude 



*(d^/2)-Iog((^/2) 
« * (d/2) - log(d/2) 



l Var([fi- 1 (/3,c7 2 )]) 
2 E[R-\p,a^ 



with ^[i?" 1 /?,^ 2 )] and Varli?- 1 ^,^ 2 )] given in (44) 
and (45). □ 
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